Statistical computation and visualization (MATH-517)
Wildfires are uncontrolled fires that burn in the wildland vegetation, often in rural areas. They are not limited to a particular continent or environment, and burned different kinds of ecosystems for hundreds of millions of years on Earth (André Gabrielli (2019)). The problem of wildfires is at the stake all over the world, along with the topic of climate change and preservation of nature and ecosystems. There has been several major cases of wildfires recently, growing in number and severity: for instance, the California wildfires in 2020 became one of the largest wildfire season in the California history (Holly Yan, Cheri Mossburg, Artemis Moshtaghian and Paul Vercammen (2020)) with several millions of acres burnt (Topher Gauk-Roger, Stella Chan, Jason Hanna and Steve Almasy (2020)). Also, Turkey went through the worst wildfire season of the country in July and August 2021 (Mert Ozkan and Ezgi Erkoyun (2021)), and the 2019-2020 bushfires in Australia (also known as the Black Summer) killed several billions of animals. A lot of them were endangered species, which some were believed to be driven to extinction from this incidence (Michael Slezak (2020)). Therefore, a lot of countries aim at minimizing the size and the number of occurrences of wildfires, since they can be the cause of many direct and indirect fatalities in humans (Steven Reinberg (2021)), as well as air pollution (Sarah Gibbens (2021)) and the loss of ecosystems and biodiversity (the case of Black Summer).
In order to reduce the number and the severity of wildfires, understanding the main factors at the origin of the catastrophy is necessary. The incidences are very often caused accidentally (burning debris, agricultural activities, campfires, smoking), or intentionally (arson, children). Although the latter case can be prevented, human or non-human accidents can always happen and they are hard to predict (Wildfire Causes (n.d.)). However, we can suspect that there are certain natural conditions that make those accidents easier to happen and to grow them bigger in size. By identifying them, the states can build a strategy to efficiently suppress wildfire once it happens, and get prepared to fight against for locations that are highly possible to catch fire at a certain period of the year.
In addition, those factors change as the time goes by, and they generate a better or worse conditions for wildfires to happen. For instance, global warming is highly suspected to be one of the main reasons why the wildfires were more recurrent in the recent days (Alejandra Borunda (2021)). Countries have been undergoing climate changes, and those unexpected events can be the seed of the recent disasters.
In this investigation, we set out to answer the following questions:
How have the number of fires in the United States of America evolved with time from 1993 to 2015?
How do the land covers vary over time?
What are the main factors that cause wildfires ?
How are fires distributed across the land covers and meteorological factors?
To answer the above questions we will proceed as follows. A descriptive and visual approach where we will produce interactive plots with information on different dimensions: geographically, temporally and by different factors. Afterwards, we will look at the distribution of land covers. We will also analyse the variation of land covers in the same location since we noticed it changed in some areas. We will then proceed with the analysis of the number of fires. To do so, we will look at the land covers and meteorological parameters. We noticed that the correlation between parameters was low, so we used subsets and performed a quantile regression.
To perform this analysis, a dataset for the United States from 1993 to 2015 will be used. It contains 563,983 rows with 37 columns. The columns are the following:
Please note that the area proportions \(lc1\) to \(lc18\) do not always sum to exactly 1 for each pixel and month since a few classes with quasi-0 proportion have been removed.
Since the original data was given under the context of a prediction competition with the University of Edimburgh, there is a 8,000 of missing values in each of the CNT and BA columns. The missing values are not located necessarily in the same lines for the two features.
When considering only rows without missing values, 452930 rows remain.
| Statistic | Min | Pctl(25) | Median | Pctl(75) | Mean | Max |
| CNT | 0 | 0 | 0 | 2 | 2.280 | 359 |
| BA | 0 | 0 | 0 | 1.6 | 158.898 | 538,054 |
As shown in the summary, wildfires remain relatively rare events with more than 75% of locations have less than 2 fires per month considering the feature \(CNT\). Same applies for the feature \(BA\) representing aggregated burnt area, where the distribution is strongly positively skewed.
As stated in the data description, the proportions of the 18 land covers don’t always add up to one. Looking at figure 1 and at the summary below, we can see that the minimum value for the sum is 0.82. It is also seen from the 1st Quantile value that only 25% of the data has a sum below approx 0.99. We therefore continue with the data considering it is close enough.
Figure 1: PRECISE DESCRIPTION OF THE FIGURE
Figure 2: PRECISE DESCRIPTION OF THE FIGURE
In figure 2 is displayed the distribution of land covers using Boxplot. We can see that most of the land cover represent less than 10% of the the area considered. Meaning that the areas considered are very diverse.
Some transformations on the features were made: first the temperature was converted from Kelvin to Celsius. Next, the U-component of wind (the wind speed in Eastern direction) and V-component of wind (the wind speed in Northern direction) were aggregated using the euclidian norm of the vector: \[W\hspace{-2pt}speed=\sqrt{{W\hspace{-2pt}speed_{East}}^2 + {W\hspace{-2pt}speed_{North}}^2 }\] with Wspeed the wind speed.
Before exploring the different factors, we first plotted on the map the number of cases of wildfires (denoted as \(CNT\)) as well as the burnt area (denoted as \(BA\)) from it with respect to time to do a descriptive analysis of the data. We want to determine which States are the most affected by wildfires, and identify when they happen the most.
The given dataset had a list of different coordinates in the United States. To determine which coordinates belong to which state, we used the python library \(\it{reverse\_geocoder}\) (link). This library gives the closest address given the coordinates, so we took the name of the state out of it and added up the numbers and stored in a dictionary for each state. We can denote \(CNT_i\) or \(BA_i\) the values of the dictionary for i a state in the U.S.. Let us also denote \(CNT_k\) or \(BA_k\) the value for k a given coordinate.
To visualize the data well, we had to make several adjustments. First, we divided the number of incidences and the burnt area by the total area of the states to make a comparison. Then we multiplied the value by \(10^5\) (for \(CNT\)) or \(10^4\) (for \(BA\)) so that we get the number of wildfires/burnt area of the state per \(10^4km^2\) or \(10^5km^2\) respectively. Also, we realized that the obtained numbers could go from the order of \(10^{-2}\) to \(10^{3}\), so we took the log of them to have a reasonable color scale for each state. The final numbers for the plots are calculated as follows:
\[Final\_CNT_i=log_2\left(\frac{10^5\left(\sum\limits_{k=coordinate\_0}^{total\_number\_of\_coordinates\_in\_i}CNT_k\right)}{total\_area\_of\_i}+1\right)\] \[Final\_BA_i=log_2\left(\frac{10^4\left(\sum\limits_{k=coordinate\_0}^{total\_number\_of\_coordinates\_in\_i}BA_k\right)}{total\_area\_of\_i}+1\right)\] for i a state in the U.S..
The numbers close to 0 are trivial, so we added 1 before taking the log to not have meaningless outliers with the scale starting with big negative numbers.
In addition, we plotted the number of incidences/burnt area for each location in red scatter points with a size scale. To adjust the numbers, we used the log scale again. The numbers were obtained as follows:
\[Local\_CNT_k=4log_2\left(CNT_k+1\right)\]
\[Local\_BA_k=2log_2\left(BA_k+1\right)\]
for k a given coordinate of the dataset.
Several interactive maps in python with the chosen scaling methods was made, but due to technical issues (ref (1AD)), deploying the maps with an external link was not possible. However, running the interactive maps on the file \(\it Visualisation\_general.ipynb\) and \(\it Visualisation\_specific.ipynb\) on local servers is still possible, provided that the needed libraries are installed. Figures 3, 5, 6, 7, 8 are animated plots of the interactive map with respect to different time frame, and the mode (\(CNT\) or \(BA\))
Figure 3 is an overview of one option that can be chosen for the plot. The color scale shows the burnt area for each states, and the circle scatter plot shows the number of incidences for each coordinates.